Skip to content

Conversation

@Kubik42
Copy link
Contributor

@Kubik42 Kubik42 commented Aug 15, 2025

This is a small refactor + bug for fix 131282.

The refactor changes how text, match_only_text, and annotated_text fields use keyword multi fields for synthetic source. Currently, this is done via the hasSyntheticSourceCompatibleKeywordField argument, where we set a boolean flag to indicate whether there is a keyword multi field that is either stored or has doc values. This is not a good approach for addressing 131282 because we want to disable the following logic for multi fields. With that disabled, the parent fields will no longer have a multi field to use for synthetic source.

We could designate one of the keyword fields as some kind of "synthetic source provider" for the parent. This way the field will always create a StoredField when ignore_above is tripped. However, this is a poor approach since it exposes how text fields are implemented to the keyword field. If the parent field decides how and what is stored, it'll be a lot clearer in the code.

This is where this PR comes in. It aims to remove hasSyntheticSourceCompatibleKeywordField (although kept for now for bwc) and instead relies on the syntheticSourceDelegate. With the addition of a new method canUseSyntheticSourceDelegateForSyntheticSource(), which is called during indexing, we can determine whether a particular keyword multi field is a valid supporter of synthetic source. If it isn't, then the parent field will explicitly create a StoredField for that.

Note: there are a lot of changed files, that said, most of them are just constructor changes. The actual changes are pretty limited.

@Kubik42
Copy link
Contributor Author

Kubik42 commented Aug 15, 2025

Accidentally nuked the previous PR while messing with branches. I did address all of the comments besides this one.

@Kubik42 Kubik42 force-pushed the 131282-2 branch 6 times, most recently from 1f84854 to 53ded56 Compare August 22, 2025 21:33
@Kubik42 Kubik42 marked this pull request as ready for review August 22, 2025 22:51
@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label and removed Team:StorageEngine labels Aug 22, 2025
storedFieldInBinaryFormat
isWithinMultiField,
storedFieldInBinaryFormat,
TextFieldMapper.SyntheticSourceHelper.syntheticSourceDelegate(getFieldType(), multiFields)
Copy link
Contributor Author

@Kubik42 Kubik42 Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was somewhat of a bug - the syntheticSourceDelegate was missing despite MatchOnlyTextMapper using it indirectly. With this change, we're now passing the delegate directly to TextFieldMapper.

"Field [" + name() + "] of type [" + CONTENT_TYPE + "] cannot run positional queries since [_source] is disabled."
);
}
if (searchExecutionContext.isSourceSynthetic() && withinMultiField) {
Copy link
Contributor Author

@Kubik42 Kubik42 Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this block of code was broken down into three smaller functions for readability:

  • parentFieldFetcher
  • delegateFieldFetcher
  • sourceFieldFetcher

@Kubik42 Kubik42 changed the title Abstracted how Text fields use Keyword fields inside of Text fields Don't store keyword multi fields when they trip ignore_above Aug 22, 2025
/**
* Returns a new {@link CompositeSyntheticFieldLoader} that merges this field loader with the given one.
*/
public CompositeSyntheticFieldLoader mergedWith(CompositeSyntheticFieldLoader other) {
Copy link
Contributor Author

@Kubik42 Kubik42 Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed to merge two field loaders: one for loading values stored by the parent field (when ignore_above is tripped), and the second for loading values stored by the keyword multi field (when ignore_above isn't tripped). Since the keyword multi field already produces a CompositeFieldLoader, I'm just extending that class.

return new BlockSourceReader.BytesRefsBlockLoader(fetcher, sourceBlockLoaderLookup(blContext));
}

public boolean isIgnoreAboveSet() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helper function for better readability

@Kubik42 Kubik42 added Team:StorageEngine and removed needs:triage Requires assignment of a team area label labels Aug 23, 2025
@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label and removed Team:StorageEngine labels Aug 23, 2025
public MatchOnlyTextFieldMapper build(MapperBuilderContext context) {
BuilderParams builderParams = builderParams(this, context);
MatchOnlyTextFieldType tft = buildFieldType(context, builderParams.multiFields());
final boolean storeSource;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic has been replaced by the new logic in parseCreateField()

* for synthetic source.
*/
public String originalName() {
return originalName;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has been moved up to TextFamilyFieldType, which KeywordFieldType now extends

Comment on lines -340 to -347
private boolean normsDefault() {
if (indexCreatedVersion.onOrAfter(IndexVersions.DISABLE_NORMS_BY_DEFAULT_FOR_LOGSDB_AND_TSDB)) {
// don't enable norms by default if the index is LOGSDB or TSDB based
return indexMode != IndexMode.LOGSDB && indexMode != IndexMode.TIME_SERIES;
}
// bwc - historically, norms were enabled by default on text fields regardless of which index mode was used
return true;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling this outside of the constructor was bad practice, hence I moved it into the constructor. See the 313-320 below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as it access fields that have been initialized the original method shouldn't be problematic. But I get what you say.

@Kubik42 Kubik42 marked this pull request as ready for review September 19, 2025 19:27
@elasticsearchmachine
Copy link
Collaborator

Hi @Kubik42, I've created a changelog YAML for you.

Copy link
Contributor

@jordan-powers jordan-powers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks good. I had a couple of nits, and a bwc question in the TextFieldMapper. This should also definitely be reviewed by Martijn.

Comment on lines 1080 to 1152
// _ignored_source field will contain entries for this field if it is not stored
// and there is no syntheticSourceDelegate.
// See #syntheticSourceSupport().
// But if a text field is a multi field it won't have an entry in _ignored_source.
// The parent might, but we don't have enough context here to figure this out.
// So we bail.
if (isSyntheticSource && syntheticSourceDelegate == null && parentField == null) {
return fallbackSyntheticSourceBlockLoader(blContext);
}

SourceValueFetcher fetcher = SourceValueFetcher.toString(blContext.sourcePaths(name()));
return new BlockSourceReader.BytesRefsBlockLoader(fetcher, blockReaderDisiLookup(blContext));
}

FallbackSyntheticSourceBlockLoader fallbackSyntheticSourceBlockLoader(BlockLoaderContext blContext) {
var reader = new FallbackSyntheticSourceBlockLoader.SingleValueReader<BytesRef>(null) {
@Override
public void convertValue(Object value, List<BytesRef> accumulator) {
if (value != null) {
accumulator.add(new BytesRef(value.toString()));
}
}

@Override
protected void parseNonNullValue(XContentParser parser, List<BytesRef> accumulator) throws IOException {
var text = parser.textOrNull();

if (text != null) {
accumulator.add(new BytesRef(text));
}
}

@Override
public void writeToBlock(List<BytesRef> values, BlockLoader.Builder blockBuilder) {
var bytesRefBuilder = (BlockLoader.BytesRefBuilder) blockBuilder;

for (var value : values) {
bytesRefBuilder.appendBytesRef(value);
}
}
};

return new FallbackSyntheticSourceBlockLoader(
reader,
name(),
IgnoredSourceFieldMapper.ignoredSourceFormat(blContext.indexSettings().getIndexVersionCreated())
) {
@Override
public Builder builder(BlockFactory factory, int expectedCount) {
return factory.bytesRefs(expectedCount);
}
};
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, in the old implementation, sometimes the synthetic source support would fall back to storing values in ignored_source. This is no longer the case with the new implementation.

However, we need to consider BWC. There will be some indices that have had data written before this change and thus might have data stored in ignored_source.

We still need to support loading values from ignored source in this case.

I suspect the reason the bwc tests have not caught this is because using the FallbackSyntheticSourceBlockLoader is just an optimization. If we use a BlockSourceReader, it will still work. However, this means it will be constructing the entire _source of the document from the synthetic source, then parsing it to determine the value. Better to use the FallbackSyntheticSourceBlockLoader which knows how to extract only the individual field we're interested in out of the ignored source.

@@ -1 +1 @@
initial_elasticsearch_8_18_8,8840010
Copy link
Contributor Author

@Kubik42 Kubik42 Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idk why these files are popping up, I'm pulling them directly from main. When I rebase before merging, I hope they'll disappear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is occurring because you are out of date with upstream main. If you sync the upstream branch (ie the "Update Branch" button in Github) the diff should disappear. But you can also ignore for now, the transport version generation task is pulling in changes from upstream main.

This behavior isn't intentional; it's a bug in transport version generation, and should go away soon.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Kubik42 - LGTM

Comment on lines -340 to -347
private boolean normsDefault() {
if (indexCreatedVersion.onOrAfter(IndexVersions.DISABLE_NORMS_BY_DEFAULT_FOR_LOGSDB_AND_TSDB)) {
// don't enable norms by default if the index is LOGSDB or TSDB based
return indexMode != IndexMode.LOGSDB && indexMode != IndexMode.TIME_SERIES;
}
// bwc - historically, norms were enabled by default on text fields regardless of which index mode was used
return true;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as it access fields that have been initialized the original method shouldn't be problematic. But I get what you say.

Copy link
Contributor

@jordan-powers jordan-powers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Dmitry!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avoid storing ignored source for multi-fields

5 participants